You are allowed and encouraged to work with one partner on this project. Include your names, perm numbers, and whether you are taking the class for 131 or 231 credit.
You are welcome and encouraged to write up your report as a research paper (e.g. abstract, introduction, methods, results, conclusion) as long as you address each of the questions below. Alternatively, you can format the assignment like a long homework by addressing each question in parts.
There should be no raw R output in the paper body! All of your results should be formatted in a professional and visually appealing manner. That means, eather as a polished visualization or for tabular data, a nicely formatted table (see the documentation for kable and kableExtra packages. If you feel you must include extensive raw R output, this should be included in an appendix, not the main report.
All R code should be available from your Rmarkdown file, but does not need to be shown in the body of the report! Use the chunk option echo=FALSE to exclude code from appearing in your writeup. In addition to your Rmarkdown, you should turn in the writuep as either a pdf document or an html file (both are acceptable).
Predicting voter behavior is complicated for many reasons despite the tremendous effort in collecting, analyzing, and understanding many available datasets. For our final project, we will analyze the 2016 presidential election dataset.
The presidential election in 2012 did not come as a surprise. Some correctly predicted the outcome of the election correctly including Nate Silver, and many speculated his approach.
Despite the success in 2012, the 2016 presidential election came as a big surprise to many, and it was a clear example that even the current state-of-the-art technology can surprise us.
Answer the following questions in one paragraph for each.
1. What makes voter behavior prediction (and thus election forecasting) a hard problem?
The data that we have can only be from the poll before the election actually happens. A lot of times, voters’ opinions will change as the election gets closer. This could be due to a particularly successful campaign ad or a sudden scandal of a candidate being exposed days before election. These random variables are difficult to measure, therefore, up until the election day, the polls generated are never 100 precent accurate. Also, sampling errors could orrcur when researchers accidentally poll more supporters of a candidate than are represented in the general population. And the voters might not answer truthfully. For example, if someone feels like he/she might get attacked by others around them if they reveal their true political opinion, then they might not speak their mind until the day of the election. On top of that, not everyone who answered the suyvey will turn out to vote.
2. What was unique to Nate Silver’s approach in 2012 that allowed him to achieve good predictions?
Instead of looking at the maximum probability, Nate Silver looked at the full range of probabilities. For each date, he proposed a prior belief of how much a candidate would get the new day. The next day he would get the new data containing the actual probability of the support, and then updaet his model after observing the actual data.
3. What went wrong in 2016? What do you think should be done to make future predictions better?
The biggest issues with 2016 election prediction were: a large change in voters’ preferences during the final days before the election, overrepresentation of college graduates did not get adjusted and many of the Trump supporters did not speak their mind during the poll until the day of election. One of the ways to improve predicting results is to ask the voters about their past voting behavior. We can combine voter’s self prediction and their past history to determine whether or not they will actually vote in the end.
## read data and convert candidate from string to factor
election.raw <- read_delim("data/election/election.csv", delim = ",") %>% mutate(candidate=as.factor(candidate))
census_meta <- read_delim("data/census/metadata.csv", delim = ";", col_names = FALSE)
census <- read_delim("data/census/census.csv", delim = ",")
The meaning of each column in election.raw is clear except fips. The accronym is short for Federal Information Processing Standard.
In our dataset, fips values denote the area (US, state, or county) that each row of data represent. For example, fips value of 6037 denotes Los Angeles County.
kable(election.raw %>% filter(county == "Los Angeles County")) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
| county | fips | candidate | state | votes |
|---|---|---|---|---|
| Los Angeles County | 6037 | Hillary Clinton | CA | 2464364 |
| Los Angeles County | 6037 | Donald Trump | CA | 769743 |
| Los Angeles County | 6037 | Gary Johnson | CA | 88968 |
| Los Angeles County | 6037 | Jill Stein | CA | 76465 |
| Los Angeles County | 6037 | Gloria La Riva | CA | 21993 |
Some rows in election.raw are summary rows and these rows have county value of NA. There are two kinds of summary rows:
fips value of US.fips value.4. Report the dimension of election.raw after removing rows with fips=2000. Provide a reason for excluding them. Please make sure to use the same name election.raw before and after removing those observations.
# removing rows with fips=2000
election.raw %>% filter(fips != 2000)
## # A tibble: 18,345 x 5
## county fips candidate state votes
## <chr> <chr> <fct> <chr> <int>
## 1 <NA> US Donald Trump US 62984825
## 2 <NA> US Hillary Clinton US 65853516
## 3 <NA> US Gary Johnson US 4489221
## 4 <NA> US Jill Stein US 1429596
## 5 <NA> US Evan McMullin US 510002
## 6 <NA> US Darrell Castle US 186545
## 7 <NA> US Gloria La Riva US 74117
## 8 <NA> US Rocky De La Fuente US 33010
## 9 <NA> US " None of these candidates" US 28863
## 10 <NA> US Richard Duncan US 24235
## # ... with 18,335 more rows
The dimension for election.row is now 18,345 x 5. We filtered out fips = 2000 because numerical values of fips represent county-level data. And we have missing values in the county column for fips = 2000.
Following is the first few rows of the census data:
kable(census %>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| CensusTract | State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | Citizen | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001020100 | Alabama | Autauga | 1948 | 940 | 1008 | 0.9 | 87.4 | 7.7 | 0.3 | 0.6 | 0.0 | 1503 | 61838 | 11900 | 25713 | 4548 | 8.1 | 8.4 | 34.7 | 17.0 | 21.3 | 11.9 | 15.2 | 90.2 | 4.8 | 0 | 0.5 | 2.3 | 2.1 | 25.0 | 943 | 77.1 | 18.3 | 4.6 | 0 | 5.4 |
| 1001020200 | Alabama | Autauga | 2156 | 1059 | 1097 | 0.8 | 40.4 | 53.3 | 0.0 | 2.3 | 0.0 | 1662 | 32303 | 13538 | 18021 | 2474 | 25.5 | 40.3 | 22.3 | 24.7 | 21.5 | 9.4 | 22.0 | 86.3 | 13.1 | 0 | 0.0 | 0.7 | 0.0 | 23.4 | 753 | 77.0 | 16.9 | 6.1 | 0 | 13.3 |
| 1001020300 | Alabama | Autauga | 2968 | 1364 | 1604 | 0.0 | 74.5 | 18.6 | 0.5 | 1.4 | 0.3 | 2335 | 44922 | 5629 | 20689 | 2817 | 12.7 | 19.7 | 31.4 | 24.9 | 22.1 | 9.2 | 12.4 | 94.8 | 2.8 | 0 | 0.0 | 0.0 | 2.5 | 19.6 | 1373 | 64.1 | 23.6 | 12.3 | 0 | 6.2 |
| 1001020400 | Alabama | Autauga | 4423 | 2172 | 2251 | 10.5 | 82.8 | 3.7 | 1.6 | 0.0 | 0.0 | 3306 | 54329 | 7003 | 24125 | 2870 | 2.1 | 1.6 | 27.0 | 20.8 | 27.0 | 8.7 | 16.4 | 86.6 | 9.1 | 0 | 0.0 | 2.6 | 1.6 | 25.3 | 1782 | 75.7 | 21.2 | 3.1 | 0 | 10.8 |
| 1001020500 | Alabama | Autauga | 10763 | 4922 | 5841 | 0.7 | 68.5 | 24.8 | 0.0 | 3.8 | 0.0 | 7666 | 51965 | 6935 | 27526 | 2813 | 11.4 | 17.5 | 49.6 | 14.2 | 18.2 | 2.1 | 15.8 | 88.0 | 10.5 | 0 | 0.0 | 0.6 | 0.9 | 24.8 | 5037 | 67.1 | 27.6 | 5.3 | 0 | 4.2 |
| 1001020600 | Alabama | Autauga | 3851 | 1787 | 2064 | 13.1 | 72.9 | 11.9 | 0.0 | 0.0 | 0.0 | 2642 | 63092 | 9585 | 30480 | 7550 | 14.4 | 21.9 | 24.2 | 17.5 | 35.4 | 7.9 | 14.9 | 82.7 | 6.9 | 0 | 0.0 | 6.0 | 4.5 | 19.8 | 1560 | 79.4 | 14.7 | 5.8 | 0 | 10.9 |
Column information is given in metadata.
5. Remove summary rows from election.raw data: i.e.,
Federal-level summary into a election_federal.
State-level summary into a election_state.
Only county-level data is to be in election.
election_federal <- election.raw %>% filter(fips == "US")
election_state <- election.raw %>% filter(fips == state & fips != "US")
election <- election.raw %>% filter(!is.na(county))
6. How many named presidential candidates were there in the 2016 election? Draw a bar chart of all votes received by each candidate. You can split this into multiple plots or may prefer to plot the results on a log scale. Either way, the results should be clear and legible!
# list of candidates
kable(election_federal$candidate %>% head, "html", caption = "List of Candidates") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
# count the number of candidates
# length(unique(election_federal$candidate))
election_federal %>%
filter(candidate != " None of these candidates") %>%
ggplot(aes(candidate, votes)) +
scale_y_log10() +
geom_bar(stat = "identity") +
ggtitle("Votes received by each candidate on a log scale") +
theme(axis.text.x = element_text(angle = 90, hjust = 1), plot.title = element_text(hjust = 0.5))
There are 31 named presidential candidates in the 2016 election
7. Create variables county_winner and state_winner by taking the candidate with the highest proportion of votes. Hint: to create county_winner, start with election, group by fips, compute total votes, and pct = votes/total. Then choose the highest row using top_n (variable state_winner is similar).
state_winner <- election_state %>% group_by(fips, add = TRUE) %>% mutate(total = sum(votes)) %>% mutate(pct = votes/total) %>% top_n(1, pct)
county_winner <- election %>% group_by(fips, add = TRUE) %>% mutate(total = sum(votes)) %>% mutate(pct = votes/total) %>% top_n(1, pct)
Visualization is crucial for gaining insight and intuition during data mining. We will map our data onto maps.
The R package ggplot2 can be used to draw maps. Consider the following code.
states <- map_data("state")
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE) # color legend is unnecessary and takes too long
The variable states contain information to draw white polygons, and fill-colors are determined by region.
8. Draw county-level map by creating counties = map_data("county"). Color by county
county <- map_data("county")
ggplot(data = county) +
geom_polygon(aes(x = long, y = lat, fill = subregion, group = group), color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE) # color legend is unnecessary and takes too long
9. Now color the map by the winning candidate for each state. First, combine states variable and state_winner we created earlier using left_join(). Note that left_join() needs to match up values of states to join the tables. A call to left_join() takes all the values from the first table and looks for matches in the second table. If it finds a match, it adds the data from the second table; if not, it adds missing values:
Here, we’ll be combing the two datasets based on state name. However, the state names are in different formats in the two tables: e.g. AZ vs. arizona. Before using left_join(), create a common column by creating a new column for states named fips = state.abb[match(some_column, some_function(state.name))]. Replace some_column and some_function to complete creation of this new column. Then left_join(). Your figure will look similar to state_level New York Times map.
states <- states %>% mutate(fips = state.abb[match(region, tolower(state.name))])
states <- left_join(states, state_winner)
# color the map by the winning candidate for each state
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white") +
coord_fixed(1.3)
10. The variable county does not have fips column. So we will create one by pooling information from maps::county.fips. Split the polyname column to region and subregion. Use left_join() combine county.fips into county. Also, left_join() previously created variable county_winner. Your figure will look similar to county-level New York Times map.
county.fips <- maps::county.fips
county.fips <- county.fips %>% separate(polyname, c("region", "subregion"), sep = ",")
county <- left_join(county, county.fips)
county_winner$fips <- as.numeric(county_winner$fips)
county <- left_join(county, county_winner)
# color the map by the winning candidate for each county
ggplot(data = county) +
geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white") +
coord_fixed(1.3)
11. Create a visualization of your choice using census data. Many exit polls noted that demographics played a big role in the election. Use this Washington Post article and this R graph gallery for ideas and inspiration.
census.explore <- census %>%
filter(complete.cases(census[setdiff(names(census), 'CensusTract')])) %>%
mutate(Hispanic.num = Hispanic/100 * TotalPop,
White.num = White/100 * TotalPop,
Black.num = Black/100 * TotalPop,
Native.num = Native/100 * TotalPop,
Asian.num = Asian/100 * TotalPop,
Pacific.num = Pacific/100 *TotalPop) %>%
select(CensusTract:Women, Hispanic.num:Pacific.num) %>%
group_by(State) %>%
summarize_at(vars(TotalPop:Pacific.num), funs(sum(.))) %>%
gather(key = 'race_population', value = 'population', Hispanic.num:Pacific.num) %>%
separate(race_population, c("race", "num")) %>%
select(-num)
ggplot(census.explore, aes(State, population, fill = race)) +
geom_bar(position = "identity", stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Population in each state by race")
12. The census data contains high resolution information (more fine-grained than county-level). In this problem, we aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. Create the following variables:
census.del: start with census, filter out any rows with missing values, convert {Men, Employed, Citizen} attributes to percentages (meta data seems to be inaccurate), compute Minority attribute by combining {Hispanic, Black, Native, Asian, Pacific}, remove these variables after creating Minority, remove {Walk, PublicWork, Construction}.census.del <- census %>%
filter(complete.cases(census[setdiff(names(census), 'CensusTract')])) %>%
mutate(Men = Men / TotalPop * 100,
Employed = Employed / TotalPop * 100,
Citizen = Citizen / TotalPop * 100) %>%
mutate(Minority = Hispanic + Black + Native + Asian + Pacific) %>%
select(-c(Women, Hispanic, Black, Native, Asian, Pacific, Walk, PublicWork, Construction))
census.subct: start with census.del from above, group_by() two attributes {State, County}, use add_tally() to compute CountyTotal. Also, compute the weight by TotalPop/CountyTotal.census.subct <- census.del %>%
group_by(State, County, add = TRUE) %>%
add_tally(TotalPop) %>%
mutate(CountyTotal = n) %>%
select(-n) %>%
mutate(weight = TotalPop / CountyTotal)
census.ct: start with census.subct, use summarize_at() to compute weighted sumcensus.ct <- census.subct %>%
summarize_at(vars(Men:Minority), funs(sum(. * weight)))
census.ct:kable(census.ct[c(100:105, 1000:1005), ] %>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| State | County | Men | White | Citizen | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | Production | Drive | Carpool | Transit | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | SelfEmployed | FamilyWork | Unemployment | Minority |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Arizona | Gila | 49.72444 | 63.67513 | 77.48519 | 38128.94 | 7742.802 | 20583.43 | 3187.475 | 22.64903 | 33.19307 | 29.21547 | 23.52462 | 22.07370 | 9.974047 | 76.89079 | 11.73858 | 0.4016853 | 2.3503640 | 5.121422 | 21.53425 | 33.78350 | 70.47708 | 4.664701 | 0.1417869 | 13.202261 | 34.89076 |
| Arizona | Graham | 53.59692 | 51.69957 | 69.02451 | 45252.63 | 6686.589 | 17481.32 | 2676.082 | 21.57643 | 27.19014 | 29.99648 | 21.39130 | 20.94082 | 12.228340 | 76.34010 | 12.87230 | 1.6948459 | 2.3358275 | 3.910525 | 21.80545 | 31.95659 | 70.23631 | 4.140407 | 0.1225840 | 13.688962 | 47.37405 |
| Arizona | Greenlee | 52.62108 | 47.24315 | 68.55813 | 50250.27 | 8026.202 | 21994.42 | 2603.522 | 13.62131 | 19.44737 | 25.96103 | 13.10784 | 13.88864 | 15.038524 | 79.39181 | 13.07987 | 0.0000000 | 2.0490081 | 3.328073 | 19.68985 | 37.08301 | 80.31936 | 4.778854 | 0.0000000 | 10.592741 | 51.65261 |
| Arizona | La Paz | 51.71903 | 50.87192 | 68.81057 | 34854.57 | 5399.670 | 19496.09 | 3259.001 | 21.12987 | 32.03473 | 22.07872 | 20.15581 | 22.40299 | 10.369465 | 78.02055 | 11.51815 | 0.5578606 | 0.8026418 | 5.226134 | 12.36893 | 35.18919 | 59.01420 | 9.368088 | 0.0000000 | 10.631233 | 47.64221 |
| Arizona | Maricopa | 49.47870 | 56.64921 | 65.53484 | 60542.26 | 9669.632 | 27761.68 | 3879.826 | 17.26386 | 22.76252 | 35.17442 | 18.92922 | 26.92891 | 10.023129 | 76.14856 | 11.37012 | 2.4650033 | 2.6253456 | 5.772809 | 25.68234 | 46.05740 | 82.44120 | 5.762338 | 0.1545059 | 7.945101 | 41.15112 |
| Arizona | Mohave | 50.33930 | 78.21948 | 76.97800 | 39249.58 | 7123.632 | 20973.73 | 3353.569 | 19.84279 | 28.50158 | 24.53038 | 24.93882 | 27.41707 | 12.484930 | 77.40242 | 14.63040 | 0.4397705 | 2.4444572 | 3.068395 | 20.01382 | 32.88717 | 78.19054 | 6.494262 | 0.3120396 | 13.113418 | 19.59276 |
13. Run PCA for both county & sub-county level data. Save the first two principle components PC1 and PC2 into a two-column data frame, call it ct.pc and subct.pc, respectively. Discuss whether you chose to center and scale the features before running PCA and the reasons for your choice. What are the three features with the largest absolute values of the first principal component? Which features have opposite signs and what does that mean about the correaltion between these features?
# County level
pr.out.ct <- prcomp(census.ct[, -c(1:2)], scale = TRUE, center = TRUE)
ct.pc <- pr.out.ct$x[, 1:2]
# Highest absolute loadings for PC1
pc1.vector.ct <- abs(pr.out.ct$rotation[, 1])
pc1.vector.ct <- sort(pc1.vector.ct, decreasing = TRUE)
kable(head(pc1.vector.ct)%>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| x | |
|---|---|
| IncomePerCap | 0.3515553 |
| ChildPoverty | 0.3436011 |
| Poverty | 0.3423083 |
| Employed | 0.3278373 |
| Income | 0.3200339 |
| Unemployment | 0.2907011 |
# Look for negative correlation in PC1
kable(pr.out.ct$rotation[, 1], caption = "Negative correlations in PC1, county level"%>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
| x | |
|---|---|
| Men | 0.0068248 |
| White | 0.2230261 |
| Citizen | 0.0046889 |
| Income | 0.3200339 |
| IncomeErr | 0.1698692 |
| IncomePerCap | 0.3515553 |
| IncomePerCapErr | 0.1945209 |
| Poverty | -0.3423083 |
| ChildPoverty | -0.3436011 |
| Professional | 0.2501644 |
| Service | -0.1817970 |
| Office | -0.0150122 |
| Production | -0.1187617 |
| Drive | -0.0941528 |
| Carpool | -0.0767026 |
| Transit | 0.0707426 |
| OtherTransp | -0.0096675 |
| WorkAtHome | 0.1748976 |
| MeanCommute | -0.0586108 |
| Employed | 0.3278373 |
| PrivateWork | 0.0564638 |
| SelfEmployed | 0.0977609 |
| FamilyWork | 0.0486661 |
| Unemployment | -0.2907011 |
| Minority | -0.2265319 |
# Subcounty level
pr.out.subct <- prcomp(census.subct[, -c(1:3)], scale = TRUE, center = TRUE)
subct.pc <- pr.out.subct$x[, 1:2]
# Highest absolute loadings for PC1
pc1.vector.subct <- abs(pr.out.subct$rotation[, 1])
pc1.vector.subct <- sort(pc1.vector.subct, decreasing = TRUE)
kable(head(pc1.vector.subct)%>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| x | |
|---|---|
| IncomePerCap | 0.3181191 |
| Professional | 0.3064354 |
| Poverty | 0.3046909 |
| Income | 0.3025096 |
| ChildPoverty | 0.2978868 |
| Service | 0.2688339 |
# Look for negative correlation in PC1
kable(pr.out.subct$rotation[, 1], caption = "Negative correlations in PC1, subcounty level"%>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
| x | |
|---|---|
| TotalPop | -0.0324186 |
| Men | -0.0172915 |
| White | -0.2404300 |
| Citizen | -0.1608628 |
| Income | -0.3025096 |
| IncomeErr | -0.1989363 |
| IncomePerCap | -0.3181191 |
| IncomePerCapErr | -0.2123317 |
| Poverty | 0.3046909 |
| ChildPoverty | 0.2978868 |
| Professional | -0.3064354 |
| Service | 0.2688339 |
| Office | 0.0138010 |
| Production | 0.2068050 |
| Drive | -0.0789278 |
| Carpool | 0.1625664 |
| Transit | 0.0573112 |
| OtherTransp | 0.0451425 |
| WorkAtHome | -0.1729726 |
| MeanCommute | -0.0099985 |
| Employed | -0.2212063 |
| PrivateWork | 0.0420216 |
| SelfEmployed | -0.0697035 |
| FamilyWork | -0.0152021 |
| Unemployment | 0.2528185 |
| Minority | 0.2420290 |
| CountyTotal | 0.0215341 |
| weight | 0.0119266 |
I choose to center and scale the features before I run PCA. In general, it is a good idea to scale the variables. Otherwise the magnitude to certain variables might dominates the associations between the variables. In our data, the variables are not recorded in the same scale, some of them are in percentage and some of them are not.
The three features with the largest absolute values of the first principal component for county level data are: IncomePerCap, ChildPoverty and Poverty. The three features with the largest absolute values of the first principal component for sub-county level data are: IncomePerCap, Professional and Poverty.
Negative loadings means some negative correlations among variables. Features have opposite signs in PC1 of county level data are Poverty, ChildPoverty, Service, Office, Production, Drive, Carpool, OtherTransp, MeanCommute, UNemployment and Minority. Features have opposite signs in PC1 of sub-county level data are just the oppsite of county level data. This just means that variables like Poverty, ChildPoverty are negatively correlated to variables like Employed, Professional and Income. Another interesting negative correlation worth pointing out is Minority. Larger minority group corresponds with less income / less employed etc. in the area.
14. Determine the number of minimum number of PCs needed to capture 90% of the variance for both the county and sub-county analyses. Plot proportion of variance explained (PVE) and cumulative PVE for both county and sub-county analyses.
# County level
pr.var.ct <- pr.out.ct$sdev ^ 2
pve.ct <- pr.var.ct/sum(pr.var.ct)
cumulative_pve.ct <- cumsum(pve.ct)
# Number of minimum number of PCs needed to capture 90% of the variance
# min(which(cumulative_pve.ct >= 0.90))
## This will put the next two plots side by side
par(mfrow=c(1, 2))
## Plot proportion of variance explained
plot(pve.ct, type="l", lwd=3, main = "PVE for County")
plot(cumulative_pve.ct, type="l", lwd=3, main = "Cumulative PVE for County")
#Sub-county level
pr.var.subct <- pr.out.subct$sdev ^ 2
pve.subct <- pr.var.subct/sum(pr.var.subct)
cumulative_pve.subct <- cumsum(pve.subct)
## This will put the next two plots side by side
par(mfrow=c(1, 2))
# Number of minimum number of PCs needed to capture 90% of the variance
# min(which(cumulative_pve.subct >= 0.90))
## Plot proportion of variance explained
plot(pve.subct, type="l", lwd=3, main = "PVE for Sub-County")
plot(cumulative_pve.subct, type="l", lwd=3, main = "Cumulative PVE for Sub-County")
Minimum number of PCs needed to capture 90% of the variance for the county is 13 and for the sub-county is 17.
15. With census.ct, perform hierarchical clustering with complete linkage. Cut the tree to partition the observations into 10 clusters. Re-run the hierarchical clustering algorithm using the first 2 principal components from ct.pc as inputs instead of the originald features. Compare and contrast the results. For both approaches investigate the cluster that contains San Mateo County. Which approach seemed to put San Mateo County in a more appropriate clusters? Comment on what you observe and discuss possible explanations for these observations.
set.seed(1)
# hierarchical clustering for census.ct
census.ct.scaled <- as.data.frame(scale(census.ct[, -c(1:2)]))
dis.census.ct <- dist(census.ct.scaled, method="euclidean")
census.ct.hclust <- hclust(dis.census.ct, method = "complete")
census.ct.hclust <- cutree(census.ct.hclust, k = 10)
# hierarchical clustering for ct.pc
ct.pc.scaled <- as.data.frame(scale(ct.pc))
dis.ct.pc <- dist(ct.pc.scaled, method="euclidean")
ct.pc.hclust <- hclust(dis.ct.pc, method = "complete")
ct.pc.hclust <- cutree(ct.pc.hclust, k = 10)
kable(census.ct %>% filter(County == "San Mateo"), caption = "San Mateo County"%>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
| State | County | Men | White | Citizen | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | Production | Drive | Carpool | Transit | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | SelfEmployed | FamilyWork | Unemployment | Minority |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| California | San Mateo | 49.19773 | 40.63851 | 64.2005 | 100369.9 | 16123.02 | 47881.29 | 6115.552 | 8.011122 | 9.705514 | 45.73565 | 18.28979 | 22.304 | 7.343291 | 69.92713 | 10.68144 | 9.257082 | 2.598808 | 5.077957 | 26.82681 | 51.72497 | 79.76635 | 8.367532 | 0.1716192 | 6.689483 | 55.53405 |
# Samples within the cluster containing "San Mateo"
# census.ct.hclust[which(census.ct$County == "San Mateo")]
samples.census.ct <- census.ct[which(census.ct.hclust == 7), ]
kable(summary(samples.census.ct), caption = "Sample summary for cluster 7"%>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
|
|
|
|
|
| IncomeErr | IncomePerCap | IncomePerCapErr |
| ChildPoverty | Professional |
|
| Production |
|
|
| OtherTransp | WorkAtHome | MeanCommute |
| PrivateWork | SelfEmployed | FamilyWork | Unemployment |
| |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:115 | Length:115 | Min. :46.77 | Min. :28.17 | Min. :58.75 | Min. : 36373 | Min. : 7151 | Min. :22991 | Min. :3117 | Min. : 2.739 | Min. : 0.000 | Min. :26.51 | Min. : 5.00 | Min. :11.97 | Min. : 1.519 | Min. :36.14 | Min. : 0.000 | Min. : 0.000 | Min. : 0.000 | Min. : 1.400 | Min. : 6.30 | Min. :40.96 | Min. :52.80 | Min. : 2.917 | Min. :0.00000 | Min. : 1.400 | Min. : 4.223 | |
| Class :character | Class :character | 1st Qu.:48.74 | 1st Qu.:60.90 | 1st Qu.:67.19 | 1st Qu.: 74500 | 1st Qu.:10803 | 1st Qu.:35306 | 1st Qu.:4245 | 1st Qu.: 5.997 | 1st Qu.: 7.105 | 1st Qu.:40.52 | 1st Qu.:13.40 | 1st Qu.:21.33 | 1st Qu.: 6.327 | 1st Qu.:68.65 | 1st Qu.: 7.145 | 1st Qu.: 1.188 | 1st Qu.: 1.002 | 1st Qu.: 4.580 | 1st Qu.:24.47 | 1st Qu.:48.94 | 1st Qu.:72.71 | 1st Qu.: 4.610 | 1st Qu.:0.09446 | 1st Qu.: 4.668 | 1st Qu.:13.998 | |
| Mode :character | Mode :character | Median :49.18 | Median :72.84 | Median :70.51 | Median : 84360 | Median :12273 | Median :38039 | Median :4784 | Median : 7.581 | Median : 9.686 | Median :44.17 | Median :15.29 | Median :23.51 | Median : 7.761 | Median :77.85 | Median : 7.979 | Median : 3.510 | Median : 1.349 | Median : 5.388 | Median :29.03 | Median :51.21 | Median :79.29 | Median : 5.712 | Median :0.13250 | Median : 6.023 | Median :24.117 | |
| NA | NA | Mean :49.65 | Mean :69.73 | Mean :70.35 | Mean : 84333 | Mean :12740 | Mean :39527 | Mean :5206 | Mean : 8.781 | Mean :10.490 | Mean :45.11 | Mean :15.80 | Mean :22.73 | Mean : 8.045 | Mean :73.83 | Mean : 8.254 | Mean : 6.819 | Mean : 1.828 | Mean : 5.786 | Mean :28.04 | Mean :51.30 | Mean :77.08 | Mean : 6.439 | Mean :0.14357 | Mean : 6.254 | Mean :27.763 | |
| NA | NA | 3rd Qu.:49.83 | 3rd Qu.:83.64 | 3rd Qu.:74.16 | 3rd Qu.: 93631 | 3rd Qu.:14238 | 3rd Qu.:43391 | 3rd Qu.:5627 | 3rd Qu.:10.833 | 3rd Qu.:12.059 | 3rd Qu.:50.07 | 3rd Qu.:17.87 | 3rd Qu.:24.97 | 3rd Qu.: 9.556 | 3rd Qu.:81.75 | 3rd Qu.: 9.190 | 3rd Qu.: 9.026 | 3rd Qu.: 2.048 | 3rd Qu.: 6.472 | 3rd Qu.:32.02 | 3rd Qu.:53.15 | 3rd Qu.:82.71 | 3rd Qu.: 6.667 | 3rd Qu.:0.16662 | 3rd Qu.: 7.567 | 3rd Qu.:36.628 | |
| NA | NA | Max. :60.06 | Max. :93.99 | Max. :79.25 | Max. :128268 | Max. :23381 | Max. :65599 | Max. :9699 | Max. :26.606 | Max. :32.357 | Max. :74.22 | Max. :25.95 | Max. :26.44 | Max. :16.100 | Max. :91.30 | Max. :16.000 | Max. :39.424 | Max. :10.238 | Max. :15.900 | Max. :42.70 | Max. :64.08 | Max. :86.55 | Max. :22.737 | Max. :0.80204 | Max. :14.866 | Max. :69.604 |
# ct.pc.hclust[which(census.ct$County == "San Mateo")]
samples.ct.cp <- census.ct[which(ct.pc.hclust == 4), ]
kable(summary(samples.ct.cp), caption = "Sample summary for cluster 4" %>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
|
|
|
|
|
| IncomeErr | IncomePerCap | IncomePerCapErr |
| ChildPoverty | Professional |
|
| Production |
|
|
| OtherTransp | WorkAtHome | MeanCommute |
| PrivateWork | SelfEmployed | FamilyWork | Unemployment |
| |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:657 | Length:657 | Min. :45.84 | Min. :14.30 | Min. :53.72 | Min. : 36373 | Min. : 2934 | Min. :21335 | Min. :1540 | Min. : 1.50 | Min. : 0.00 | Min. :20.32 | Min. : 5.00 | Min. :12.73 | Min. : 2.40 | Min. :32.13 | Min. : 0.000 | Min. : 0.0000 | Min. : 0.0000 | Min. : 1.262 | Min. : 9.707 | Min. :38.03 | Min. :53.42 | Min. : 2.608 | Min. :0.00000 | Min. : 0.900 | Min. : 0.4824 | |
| Class :character | Class :character | 1st Qu.:48.95 | 1st Qu.:77.32 | 1st Qu.:72.15 | 1st Qu.: 54879 | 1st Qu.: 7006 | 1st Qu.:27695 | 1st Qu.:3038 | 1st Qu.: 8.20 | 1st Qu.: 9.99 | 1st Qu.:32.23 | 1st Qu.:14.90 | 1st Qu.:21.69 | 1st Qu.: 9.16 | 1st Qu.:77.07 | 1st Qu.: 7.815 | 1st Qu.: 0.2067 | 1st Qu.: 0.9023 | 1st Qu.: 3.708 | 1st Qu.:20.055 | 1st Qu.:47.71 | 1st Qu.:74.13 | 1st Qu.: 4.920 | 1st Qu.:0.09495 | 1st Qu.: 4.182 | 1st Qu.: 5.0390 | |
| Mode :character | Mode :character | Median :49.47 | Median :87.82 | Median :74.79 | Median : 60656 | Median : 8535 | Median :29922 | Median :3696 | Median :10.15 | Median :12.85 | Median :35.56 | Median :16.37 | Median :23.61 | Median :12.17 | Median :80.74 | Median : 8.887 | Median : 0.5721 | Median : 1.2320 | Median : 4.783 | Median :23.728 | Median :49.70 | Median :78.32 | Median : 6.473 | Median :0.15132 | Median : 5.562 | Median :10.5435 | |
| NA | NA | Mean :49.64 | Mean :82.99 | Mean :74.00 | Mean : 64355 | Mean : 8860 | Mean :31192 | Mean :3785 | Mean :10.31 | Mean :13.02 | Mean :36.57 | Mean :16.48 | Mean :23.26 | Mean :12.82 | Mean :79.62 | Mean : 9.139 | Mean : 1.8255 | Mean : 1.4183 | Mean : 5.090 | Mean :24.129 | Mean :49.77 | Mean :77.64 | Mean : 7.161 | Mean :0.20117 | Mean : 5.589 | Mean :15.0869 | |
| NA | NA | 3rd Qu.:50.13 | 3rd Qu.:93.61 | 3rd Qu.:76.86 | 3rd Qu.: 71308 | 3rd Qu.:10435 | 3rd Qu.:33528 | 3rd Qu.:4303 | 3rd Qu.:12.01 | 3rd Qu.:16.18 | 3rd Qu.:39.94 | 3rd Qu.:17.98 | 3rd Qu.:25.12 | 3rd Qu.:15.81 | 3rd Qu.:83.50 | 3rd Qu.:10.107 | 3rd Qu.: 1.7033 | 3rd Qu.: 1.7044 | 3rd Qu.: 6.121 | 3rd Qu.:27.805 | 3rd Qu.:51.96 | 3rd Qu.:81.92 | 3rd Qu.: 8.650 | 3rd Qu.:0.23676 | 3rd Qu.: 6.963 | 3rd Qu.:20.2654 | |
| NA | NA | Max. :55.51 | Max. :99.17 | Max. :86.43 | Max. :128268 | Max. :23381 | Max. :51353 | Max. :9699 | Max. :23.49 | Max. :27.62 | Max. :60.07 | Max. :26.05 | Max. :31.20 | Max. :33.17 | Max. :91.30 | Max. :22.400 | Max. :51.5809 | Max. :10.2377 | Max. :12.895 | Max. :42.770 | Max. :62.79 | Max. :87.94 | Max. :21.500 | Max. :1.85733 | Max. :11.235 | Max. :83.3101 |
Hierarchical clustering using census.ct seems like a better approach. From PCA analysis, we already know that the three features that play important role for county level data are: IncomePerCap, ChildPoverty and Poverty. Using census.ct data, San Mateo belongs in cluster 7. Using ct.pc data, San Mateo belongs to cluster 4. We then compared the mean of IncomePerCap, ChildPoverty and Poverty within each cluster with the actual IncomePerCap, ChildPoverty and Poverty rate of San Mateo. The samples in cluster 7 using census.ct.data has closer means to San Mateo than the samples in cluster 4 using ct.pc data. The same for other variables, samples from custer 7 using census.ct seem to have a closer mean to the actual observation. So cluster 7 using census.ct might be a better fit for San Mateo. This could be due to important cluster separation might sometimes take place in dimensions with weak variance, therefore dimension reduction might not always be better before clustering.
In order to train classification models, we need to combine county_winner and census.ct data. This seemingly straightforward task is harder than it sounds. Following code makes necessary changes to merge them into election.cl for classification.
tmpwinner <- county_winner %>% ungroup %>%
mutate(state = state.name[match(state, state.abb)]) %>% ## state abbreviations
mutate_at(vars(state, county), tolower) %>% ## to all lowercase
mutate(county = gsub(" county| columbia| city| parish", "", county)) ## remove suffixes
tmpcensus <- census.ct %>% ungroup %>%
mutate_at(vars(State, County), tolower)
election.cl <- tmpwinner %>%
left_join(tmpcensus, by = c("state"="State", "county"="County")) %>%
na.omit
## save meta information
election.meta <- election.cl %>% select(c(county, fips, state, votes, pct, total))
## save predictors and class labels
election.cl = election.cl %>% select(-c(county, fips, state, votes, pct, total))
Using the following code, partition data into 80% training and 20% testing:
set.seed(10)
n <- nrow(election.cl)
in.trn <- sample.int(n, 0.8*n)
trn.cl <- election.cl[ in.trn,]
tst.cl <- election.cl[-in.trn,]
Using the following code, define 10 cross-validation folds:
set.seed(20)
nfold <- 10
folds <- sample(cut(1:nrow(trn.cl), breaks=nfold, labels=FALSE))
Using the following error rate function:
calc_error_rate = function(predicted.value, true.value){
return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=5, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso", "random forest", "knn")
16. Decision tree: train a decision tree by cv.tree(). Prune tree to minimize misclassification error. Be sure to use the folds from above for cross-validation. Visualize the trees before and after pruning. Save training and test errors to records variable. Intepret and discuss the results of the decision tree analysis. Use this plot to tell a story about voting behavior in the US (remember the NYT infographic?)
# Fit model on training set
tree.election <- tree(candidate ~., data = trn.cl)
# Plot the tree
plot(tree.election)
text(tree.election, pretty = 0, cex = .5, col = "red")
title("Classification Tree Before Pruning")
# Set random seed
set.seed(69)
# K-Fold cross validation
cv <- cv.tree(tree.election, FUN = prune.misclass, folds)
# Print out cv
# cv
# Best size
best.cv <- cv$size[which(cv$dev == min(cv$dev))]
best.size.cv <- min(best.cv)
# Prune tree.election
tree.election.pruned <- prune.misclass (tree.election, best = best.size.cv)
# Plot the tree
plot(tree.election.pruned)
text(tree.election.pruned, pretty = 0, cex = .6, col = "red")
title("Classification Tree After Pruning")
# Predict on train set and calculate train error
pred.tree.train <- predict(tree.election.pruned, trn.cl, type="class")
tree_train_err <- calc_error_rate(pred.tree.train, trn.cl$candidate)
# Predict on test set and calculate test error
pred.tree.test <- predict(tree.election.pruned, tst.cl, type="class")
tree_test_err <- calc_error_rate(pred.tree.test, tst.cl$candidate)
records[1, 1] <- tree_train_err
records[1, 2] <- tree_test_err
kable(records, caption = "Records", "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
| train.error | test.error | |
|---|---|---|
| tree | 0.0749186 | 0.0830619 |
| logistic | NA | NA |
| lasso | NA | NA |
| random forest | NA | NA |
| knn | NA | NA |
The decision tree model gave us a 0.074 error rate on training dataset and slightly higher error rate of 0.083 testing datset. Overall, it did a good job. From the classification tree plot after pruning, we can see that counties that rely on public transiportation more tend to vote for Hillary. This could be due to areas where people rely on public transit more are usually the urban areas. In addition, the mojority of white demographic majority areas voted for Trump. For regions that white population less than 48 percent, the areas where income are low will vote for Hillary. And areas where unemployment rates are high would vote for Hillary as well. This is similar to what we have known: Demoncratics usually do better among minority and among poverty areas.
17. Run a logistic regression to predict the winning candidate in each county. Save training and test errors to records variable. What are the significant variables? Are the consistent with what you saw in decision tree analysis? Interpret the meaning of a couple of the significant coefficients in terms of a unit change in the variables.
glm.fit <- glm(candidate ~.,
data = trn.cl, family=binomial)
# Summarize the logistic regression model
kable(glm.fit$coefficients, caption = "logistic coefficient"%>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
| x | |
|---|---|
| (Intercept) | -11.5774984 |
| Men | 0.0858465 |
| White | -0.2216640 |
| Citizen | 0.1018512 |
| Income | -0.0000758 |
| IncomeErr | -0.0000388 |
| IncomePerCap | 0.0002676 |
| IncomePerCapErr | -0.0002785 |
| Poverty | 0.0199018 |
| ChildPoverty | -0.0074088 |
| Professional | 0.2816889 |
| Service | 0.3658750 |
| Office | 0.1051116 |
| Production | 0.1844630 |
| Drive | -0.2562350 |
| Carpool | -0.2462987 |
| Transit | -0.0086047 |
| OtherTransp | -0.0969895 |
| WorkAtHome | -0.2100919 |
| MeanCommute | 0.0632558 |
| Employed | 0.1652801 |
| PrivateWork | 0.0925337 |
| SelfEmployed | 0.0122624 |
| FamilyWork | -1.1913135 |
| Unemployment | 0.1837887 |
| Minority | -0.0891958 |
In above results, White, Citizen, IncomePerCap, Professional, Service, Production, Drive, Carpool, Employed, PrivateWork and Unemployment are statistically highly significant at level 0.001. The logistic regression coefficients, if logit link function is used, give the change in the log odds of the outcome for a one unit increase in a predictor variable, while others being held constant. For example:
The variable White has a coefficient -0.2217. For every one unit increase in White, the log odds of voting for Hillary (versus voting for Trump) decreases by 0.2217, holding other variables fixed. This is consistent with what we established in decision tree model: White population likes to vote for Trump.
The variable Unemployment has a coefficient 0.1838. For a one unit increase in Unemloyment, the log odds of voting for Hillary (versus voting for Trump) increases by 0.1838, holding other variables fixed. This is also consistent with what we seen in decision tree model: Unemployed population likes to vote for Hilliary.
Some other variables like IncomePerCap are also consistant with the conclusions we got from decision tree: Hilliary are more popular than Trump in poverty area. However, in decision tree, we use Transit as an important feature for classification. But in our logistic model, while holding other variables fixed, transit does not have a statistically significant value.
# predict training and testing datasets
prob.training <- predict(glm.fit, newdata = trn.cl, type="response")
prob.training <- round(prob.training, digits=2)
prob.test <- predict(glm.fit, newdata = tst.cl, type="response")
prob.test <- round(prob.test, digits=2)
# Save the predicted labels using 0.5 as a threshold
pred.logistic.training <- as.factor(ifelse(prob.training<=0.5, "Donald Trump", "Hillary Clinton"))
pred.logistic.test <- as.factor(ifelse(prob.test<=0.5, "Donald Trump", "Hillary Clinton"))
# training error
logistic_training_error <-
calc_error_rate(pred.logistic.training, droplevels(trn.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
# testing error
logistic_testing_error <-
calc_error_rate(pred.logistic.test, droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
# pass in to records
records[2, 1] <- logistic_training_error
records[2, 2] <- logistic_testing_error
kable(records, caption = "Records", "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
| train.error | test.error | |
|---|---|---|
| tree | 0.0749186 | 0.0830619 |
| logistic | 0.0671824 | 0.0749186 |
| lasso | NA | NA |
| random forest | NA | NA |
| knn | NA | NA |
18. You may notice that you get a warning glm.fit: fitted probabilities numerically 0 or 1 occurred. As we discussed in class, this is an indication that we have perfect separation (some linear combination of variables perfectly predicts the winner). This is usually a sign that we are overfitting. One way to control overfitting in logistic regression is through regularization. Use the cv.glmnet function from the glmnet library to run K-fold cross validation and select the best regularization parameter for the logistic regression with LASSO penalty. Reminder: set alpha=1 to run LASSO regression, set lambda = c(1, 5, 10, 50) * 1e-4 in cv.glmnet() function to set pre-defined candidate values for the tuning parameter \(\lambda\). This is because the default candidate values of \(\lambda\) in cv.glmnet() is relatively too large for our dataset thus we use pre-defined candidate values. What is the optimal value of \(\lambda\) in cross validation? What are the non-zero coefficients in the LASSO regression for the optimal value of \(\lambda\)? How do they compare to the unpenalized logistic regression? Save training and test errors to the records variable.
set.seed(1)
cv.out.lasso <- cv.glmnet(trn.cl %>%
select(-candidate) %>%
as.matrix, trn.cl$candidate %>%
droplevels(except = c("Donald Trump", "Hillary Clinton")), alpha = 1, lambda = c(1, 5, 10, 50) * 1e-4, nfolds = 10, family="binomial")
bestlam <- cv.out.lasso$lambda.min
# bestlam
lasso.mod <- glmnet(trn.cl %>%
select(-candidate) %>%
as.matrix, trn.cl$candidate %>%
droplevels(except = c("Donald Trump", "Hillary Clinton")), alpha = 1, lambda = bestlam, family="binomial")
# predict training and testing datasets
lasso.pred.training <- predict(lasso.mod, newx = trn.cl %>% select(-candidate) %>% as.matrix, type = "class")
lasso.pred.testing <- predict(lasso.mod, newx = tst.cl %>% select(-candidate) %>% as.matrix, type = "class")
# training error
lasso_training_error <-
calc_error_rate(lasso.pred.training, droplevels(trn.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
# testing error
lasso_testing_error <-
calc_error_rate(lasso.pred.testing, droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
# pass in to records
records[3, 1] <- lasso_training_error
records[3, 2] <- lasso_testing_error
kable(records, caption = "Records", "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
| train.error | test.error | |
|---|---|---|
| tree | 0.0749186 | 0.0830619 |
| logistic | 0.0671824 | 0.0749186 |
| lasso | 0.0663681 | 0.0781759 |
| random forest | NA | NA |
| knn | NA | NA |
lasso.coef <- predict(lasso.mod, type="coefficients")[1:26,]
kable(lasso.coef, caption = "lasso coefficient"%>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
| x | |
|---|---|
| (Intercept) | -20.4054760 |
| Men | 0.0491127 |
| White | -0.1238268 |
| Citizen | 0.1111130 |
| Income | -0.0000465 |
| IncomeErr | -0.0000466 |
| IncomePerCap | 0.0001982 |
| IncomePerCapErr | -0.0002012 |
| Poverty | 0.0149537 |
| ChildPoverty | 0.0000000 |
| Professional | 0.2553687 |
| Service | 0.3352589 |
| Office | 0.0779562 |
| Production | 0.1497450 |
| Drive | -0.2083447 |
| Carpool | -0.1936119 |
| Transit | 0.0417913 |
| OtherTransp | -0.0393694 |
| WorkAtHome | -0.1516355 |
| MeanCommute | 0.0435754 |
| Employed | 0.1563779 |
| PrivateWork | 0.0848988 |
| SelfEmployed | 0.0000000 |
| FamilyWork | -1.0274680 |
| Unemployment | 0.1746343 |
| Minority | 0.0000000 |
The optimal value of \(\lambda\) in cross validation is 5e-04. The non-zero coefficients in the LASSO regression are every variable except for SelfEmployed, ChildPoverty and Minority. The absolute value of most of the values in the lasso coeffient is half of those in logistic regression.
19. Compute ROC curves for the decision tree, logistic regression and LASSO logistic regression using predictions on the test data. Display them on the same plot. Based on your classification results, discuss the pros and cons of the various methods. Are the different classifiers more appropriate for answering different kinds of questions about the election?
# Find the testing probabilities
prob.tree.test <- predict(tree.election.pruned, tst.cl, type="vector")
prob.logistic.test <- predict(glm.fit, tst.cl, type="response")
prob.lasso.test <- predict(lasso.mod, newx = tst.cl %>% select(-candidate) %>% as.matrix, type = "response")
# First arument is the predicted probabilities, second is true labels
pred_tree <- prediction(prob.tree.test[, 13], droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
pred_logistic <- prediction(prob.logistic.test, droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
pred_lasso <- prediction(prob.lasso.test, droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
# Secondly, we calculate the True Positive Rate and False Positive Rate by `performance()`.
# We want TPR on the y axis and FPR on the x axis
perf_tree <- performance(pred_tree, measure="tpr", x.measure="fpr")
perf_logistic <- performance(pred_logistic, measure="tpr", x.measure="fpr")
perf_lasso <- performance(pred_lasso, measure="tpr", x.measure="fpr")
# Plot the ROC curve
plot(perf_tree, col=2, lwd=2, main="ROC curve")
abline(0,1)
par(new=TRUE)
plot(perf_logistic, col=6, lwd=2)
par(new=TRUE)
plot(perf_lasso, col=3, lwd=2)
# Calculate AUC
auc.tree <- performance(pred_tree, "auc")@y.values
auc.logistic <- performance(pred_logistic, "auc")@y.values
auc.lasso <- performance(pred_lasso, "auc")@y.values
# auc.tree
# auc.logistic
# auc.lasso
From the ROC curve, we get the lasso prediction AUC is 0.9657537, which is larger than the AUC of tree prediction and logistic prediction, which is 0.917 and 0.96 respectively. Usually, an AUC larger than 0.95 is generally a good classification. This tells us lasso and logistic are better classfication than decision tree in this case, and by using regularization for the logistic model with the lasso penatly, we decrease overfitting on the training dataset, therefore has better prediction results on the testing dataset. The decision tree method gives us a clear visualization about the tree splits based on important factors, but lasso model and logistic model provide better analysis of each variable’s importance while holding other variables fixed. This answers the question of what are some of the most important factors affecting voters’ choice while holding all other variables constant.
20. This is an open question. Interpret and discuss any overall insights gained in this analysis and possible explanations. Use any tools at your disposal to make your case: visualize errors on the map, discuss what does/doesn’t seems reasonable based on your understanding of these methods, propose possible directions (collecting additional data, domain knowledge, etc). In addition, propose and tackle at least one more interesting question. Creative and thoughtful analyses will be rewarded! This part will be worth up to a 20% of your final project grade!
We combine ‘county’ and ‘tmpcensus’into ’election_map’ using ‘left_join’, and then use our previous decision tree model and logistic regression model to predict on ‘election_map’. ‘Election_map’ has all the observations from ‘county’, and contain all the necessary information to draw a U.S map.
county_map <- county %>% ungroup %>%
mutate(state = state.name[match(state, state.abb)]) %>% ## state abbreviations
mutate_at(vars(state, county), tolower) %>% ## to all lowercase
mutate(county = gsub(" county| columbia| city| parish", "", county)) ## remove suffixes
election_map <- county_map %>%
left_join(tmpcensus, by = c("state"="State", "county"="County")) %>%
na.omit
# Decision tree model error
pred.tree.map <- predict(tree.election.pruned, election_map, type="class")
error.map <- election_map %>% mutate(prediction = ifelse(election_map$candidate != pred.tree.map, "incorrect", "correct"))
ggplot(data = error.map) +
geom_polygon(aes(x = long, y = lat, fill = prediction, group = group), color = "white") +
coord_fixed(1.3) +
ggtitle("Predition outcome for decision tree")
# Logistic model error
prob.logistic.map <- predict(glm.fit, election_map, type="response")
pred.logistic.map <- as.factor(ifelse(prob.logistic.map<=0.5, "Donald Trump", "Hillary Clinton"))
error.map <- election_map %>% mutate(prediction = ifelse(droplevels(election_map$candidate, except = c("Donald Trump", "Hillary Clinton")) != pred.logistic.map, "incorrect", "correct"))
ggplot(data = error.map) +
geom_polygon(aes(x = long, y = lat, fill = prediction, group = group), color = "white") +
coord_fixed(1.3) +
ggtitle("Predition outcome for logistic regression")
As we can see from above, logistic model has less incorrect prediction than decision tree model. Also worth noticing that most of the prediction improvements are from the west side of the countries.
One thing that seems odd to me is the Transit variable in decision tree classification. We later found out that Income/IncomePerCap, Unemplyment and White variables used by decision tree model are also significantly important in other classification methods. Transit, however, does not show a level a significance when we use logistic regression or lasso regression. One possible explanation for this could be although Transit as a single variable does not have meaningful impart on predicting voters’ preference, it has strong correlation with other variables. Therefore, we can also further explore this relationship by looking at the interaction terms in regression model. This will allow us to test more hypothesis about the effect of Transit for different values of other variables and therefore reach more accurate prediction results.
In PCA, IncomePerCap, ChildPoverty and Poverty are the three most important factors in the first principal component. However, ChildPoverty and Minority have coefficients of 0 in our Lasso regression. This might look contradictary at first, but then again this could be due to the ChildPoverty is higly related to, say Poverty. And we have only looked at the variables separately while holding all others constant in the regression models. Also even though we suspect that minority has a significant impact on voters’ behavior, this variable could also highly correlated with variables like White. Different values of White can mean different impact of Minority on the regression model.
ChildPoverty and Minority have negative relationships with voting for Hillary. Although not significant, this seem strange to me because we thought larger ChildPoverty rate or Minority rate means more support for Hillary. In general, we can interpret and explain the results of our models better if we have clearer understanding of the U.S politics and/or past voting history. Perhaps we can gether more information from the past election dataset, and then build more models to analyze different factors affecting different voting behavior.
trn.cl$candidate <- droplevels(trn.cl$candidate, except = c("Donald Trump", "Hillary Clinton"))
rf.election <- randomForest(candidate ~ ., data=trn.cl, ntree=500, importance=TRUE)
# rf.election
plot(rf.election)
kable(importance(rf.election)%>% head, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%")
| Donald Trump | Hillary Clinton | MeanDecreaseAccuracy | MeanDecreaseGini | |
|---|---|---|---|---|
| Men | 8.5451155 | 11.707915 | 13.94213 | 16.55625 |
| White | 25.4968974 | 35.972362 | 36.08114 | 84.35682 |
| Citizen | 12.3470046 | 5.343751 | 13.81806 | 15.98468 |
| Income | 12.0782016 | 15.205013 | 19.20286 | 21.34947 |
| IncomeErr | 0.3424663 | 12.007279 | 11.33028 | 14.97626 |
| IncomePerCap | 13.1941166 | 15.353247 | 19.30548 | 24.62288 |
varImpPlot(rf.election, sort=T, main="Variable Importance for Predicting Election", n.var=5)
pred.rf.train <- predict (rf.election, newdata = trn.cl)
pred.rf.test <- predict (rf.election, newdata = tst.cl)
# training error
rf_training_error <-
calc_error_rate(pred.rf.train, droplevels(trn.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
# testing error
rf_testing_error <-
calc_error_rate(pred.rf.test, droplevels(tst.cl$candidate, except = c("Donald Trump", "Hillary Clinton")))
# pass in to records
records[4, 1] <- rf_training_error
records[4, 2] <- rf_testing_error
kable(records, caption = "Records", "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
| train.error | test.error | |
|---|---|---|
| tree | 0.0749186 | 0.0830619 |
| logistic | 0.0671824 | 0.0749186 |
| lasso | 0.0663681 | 0.0781759 |
| random forest | 0.0012215 | 0.0456026 |
| knn | NA | NA |
We see that top 5 most impartant factors are Transit, White, Minority Professional and Unemployment. This is pretty consistent with what we have for other classification models. We have greatly improve our classification model by reducing the error rate of training and testing to 0.001222 and 0.0439, respectively.
We also would like to try the K nearst neighbor method for classification.
# YTrain is the true labels for High on the training set, XTrain is the design matrix
YTrain = trn.cl$candidate
XTrain = trn.cl %>% select(-candidate)
# YTest is the true labels for High on the test set, Xtest is the design matrix
YTest = tst.cl$candidate
XTest = tst.cl %>% select(-candidate)
# LOOCV
validation.error = NULL
allK = 1:50
set.seed(66)
# For each number in allK, use LOOCV to find a validation error
for (i in allK){ # Loop through different number of neighbors
pred.Yval = knn.cv(train=XTrain, cl=YTrain, k=i) # Predict on the left-out validation set
validation.error = c(validation.error, mean(pred.Yval!=YTrain)) # Combine all validation errors
}
numneighbor = max(allK[validation.error == min(validation.error)])
# numneighbor
# knn on the train set
pred.YTtrain = knn(train=XTrain, test=XTrain, cl=YTrain, k=numneighbor)
conf.train = table(predicted=pred.YTtrain, true=YTrain)
knn.train.err <- 1 - sum(diag(conf.train)/sum(conf.train))
# knn on test set
pred.YTest = knn(train=XTrain, test=XTest, cl=YTrain, k=numneighbor)
conf.test = table(predicted=pred.YTest, true=YTest)
knn.test.err <- 1 - sum(diag(conf.test)/sum(conf.test))
records[5, 1] <- knn.train.err
records[5, 2] <- knn.test.err
kable(records, caption = "Records", "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
| train.error | test.error | |
|---|---|---|
| tree | 0.0749186 | 0.0830619 |
| logistic | 0.0671824 | 0.0749186 |
| lasso | 0.0663681 | 0.0781759 |
| random forest | 0.0012215 | 0.0456026 |
| knn | 0.1307003 | 1.0000000 |
Using LOOCV method, the best number of neighbors is 27. Training our election data on KNN, we get a training error rate of 0.13 and a testing error rate of 0.15. Obviously, this is not a very good classification model for our election dataset.